-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-463: Add more types - time, nano timestamps, UUID to Variant spec #464
GH-463: Add more types - time, nano timestamps, UUID to Variant spec #464
Conversation
VariantEncoding.md
Outdated
@@ -391,6 +391,10 @@ The Decimal type contains a scale, but no precision. The implied precision of a | |||
| Float | float | `14` | FLOAT | IEEE little-endian | | |||
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | |||
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | |||
| TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian | | |||
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that iceberg has added timestamp_ns
in the V3 spec. From the Parquet side, we'd better be more generic. Should we consider parameterized timestamp type like what Trino does for Variant type: https://trino.io/docs/current/language/types.html#timestamp-p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree here. This is also in line with the LogicalTypeAnnotation in parquet-java
.
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | | |
| Timestamp | timestamp | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | |
Since this hasn't been published yet, I would also propose:
- Group these with the other timestamp/time types.
- Remove the NTZ, as it is similar to the timeunit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I will try to make the name change to align with Parquet annotation The only thing is that we need to have separate type ID for ltz/ntz and micro/nano seconds. so it would be cleaner to have them into separate entries rather than grouping them with parameter.
VariantEncoding.md
Outdated
@@ -386,11 +386,15 @@ The Decimal type contains a scale, but no precision. The implied precision of a | |||
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | |||
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | |||
| Date | date | `11` | DATE | 4 byte little-endian | | |||
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | | |||
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | | |||
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume some of these changes were weren't' intended, these lines shouldn't be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH nvm, I see what's going on here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe I don't, "timestamp" isn't a physical type in parquet correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no Date or timestamp for physical type. I'm wondering if this column should mean type category
. @gene-db ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric).
VariantEncoding.md
Outdated
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | | ||
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | | ||
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | | ||
| Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are trying to correct this it seems like timetamp without time zone should be the logical type and the physical type is int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this duplicates the the question @RussellSpitzer asked abve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is answered in the paragraph below. Although "behaves" the same is ambiguous. I'm not sure if the intent here Is if parquet has flexibility to change the physical type. We should make sure we clarify this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just saw the explanation below for logical type.
I think it means: for a string, the engines can choose to encode differently but when it's read, they should be the same.
I updated the logical types for newly added types. Since engines may choose different precisions to encode, they have the same logical type from the paragraph below.
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
VariantEncoding.md
Outdated
| Float | float | `14` | FLOAT | IEEE little-endian | | ||
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | ||
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | ||
| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | ||
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think logical type should indicate precision here (and aboe for the other timestamps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually reading the paragraph below I guess we don't need precision but do need nts
VariantEncoding.md
Outdated
| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | ||
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | | ||
| Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | | ||
| UUID | uuid | `24` | UUID | 16 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably note there the encoding order (even though it is duplicative)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 16-byte big-endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once we iron out wording on timestamp we can merge.
45420e7
to
de25a28
Compare
This LGTM, @RussellSpitzer any more comments. |
Will merge end of week if there aren't more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @emkornfield.
@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet. |
Yeah. Totally agree that we should avoid using logical type. I was thinking of using "type category" or "type group" before to avoid overload. But "Type Equivalence Class" also works for me. |
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Going to merge. Thanks @aihuaxu |
Rationale for this change
Iceberg tables support time, nano timestamps, UUID types and currently Variant spec doesn't include those. Propose to add those missing types.
What changes are included in this PR?
The spec change.
Fix #463